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Abstract 

Discriminating data classes emanating from sensors is an important 
problem with many applications in science and technology. We describe 
a new transform for pattern identification that interprets patterns as prob¬ 
ability density functions, and has special properties with regards to classi¬ 
fication. The transform, which we denote as the Cumulative Distribution 
Transform (CDT) is invertible, with well defined forward and inverse opera¬ 
tions. We show that it can be useful in ‘parsing out’ variations (confounds) 
that are ‘Lagrangian’ (displacement and intensity variations) by converting 
these to ‘Eulerian’ (intensity variations) in transform space. This conversion 
is the basis for our main result that describes when the CDT can allow for 
linear classification to be possible in transform space. We also describe sev¬ 
eral properties of the transform and show, with computational experiments 
that used both real and simulated data, that the CDT can help render a 
variety of real world problems simpler to solve. 
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1. Introduction 


Mathematical transforms are useful tools in engineering, physics, and 
mathematics given that they can often render certain problems easier to 
solve in transform space. Fourier transforms |21] for example, are well-known 
for providing simple answers related to the analysis of linear time-invariant 
systems. Wavelet transforms, on the other hand, are well suited for detecting 
and analyzing signal transients (fast changes) |29j . These and other trans¬ 
forms have been instrumental in the design of sampling and reconstruction 
algorithms for analog-to-digital conversion, modulation and demodulation, 
compression, communications, etc, and have found numerous applications 
in science and technology. 

On the other hand, the past few decades have brought about the emer¬ 
gence of ubiquitous, accurate, user friendly, and low cost digital sensing 
devices. These devices produce a wealth of data about the world we live 
in, ranging from digital microscopy images of sub-cellular patterns to satel¬ 
lite imagery and detailed telescope images of our universe. The relative 
ease with which vast amounts of data can be accessed and queried for in¬ 
formation have brought about challenges related to ‘telling signals apart’, 
or sensor data classification. Examples include being able to distinguish 
between benign and malignant tumors from medical images |19j . between 
‘normal’ and ‘abnormal’ physiological sensor data (e.g. flow cytometry) j34| . 
identifying people from images of faces or fingerprints im, identifying bio¬ 
logical/chemical threats from resonant optical spectra |16j and others. The 
high-dimensional nature of the measurements in relation to the number of 
samples available often makes these problems challenging. 

Important practical questions often arise in the process of designing so¬ 
lution to many data classification problems. Examples would be: “Which 
features should be extracted?”, “What classifier should be used?”, “How 
can one model, visualize and understand any discriminating variations in 
the dataset?”, etc. Eor many applications where optimal feature sets are 
yet to be discovered, researchers are faced with the task of utilizing a trial 
and error approach that involves testing for different combinations of fea¬ 
tures ffSlES], classifiers kernels [33] in the effort to arriving at a useful 
solution of the problem. We note that many of the available signal trans¬ 
forms (Wavelet, Fourier, Hilbert, etc.) are linear transforms, and thus offer 
limited capabilities related to enhancing or facilitating separation in feature 
(transform) space unless some non-linear operations are performed. 

Here we describe a new one dimensional signal transformation frame¬ 
work, with well defined analysis (forward transform) and synthesis (inverse 
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transform) operations that, for signals that can be interpreted as probability 
density functions, can help facilitate the problem of recognition. Denoted as 
the Cumulative Distribution transform (CDT), the CDT can be viewed as a 
one to one mapping between the space of smooth probability densities and 
the space of differentiable functions, and therefore by definition retains all of 
the signal information. We show that the CDT can be computed efficiently, 
and is able to turn certain types of classification problems linearly separable 
in the transform space. In contrast to linear data transformation frameworks 
(e.g. Fourier and Wavelet transforms) which simply consider signal inten¬ 
sities at fixed coordinate points, thus adopting an ‘Eulerian’ point of view, 
the idea behind the CDT is to also consider the location of the intensities in 
a signal, with respect to a chosen reference, in the effort to ‘simplify’ pat¬ 
tern recognition problems. Thus, the CDT adopts a ‘Lagrangian’ point of 
view for analyzing signals. The idea is similar to our work on linear optimal 
transport m. and the links will be explicitly elucidated below. 

1.1. Signal discrimination problems: 

Let P and Q denote two disjoint classes of functions (signals) within a 
normed vector space V. The goal in classification is to deduce a functional to 
‘regress’ a given label for each signal [3]. For a binary classification problem, 
the label of each signal can be considered 0/1 or -1/-I-1, and the problem of 
classifying a signal / can be solved by finding a linear functional T : V —>■ R 
and 6 G M such that 


T(f) <b V/ G P, 

T{f) >b V/ G Q. (1) 

Below we specifically consider the case when T is a linear classifier in V. 
For example, for real functions in one may find w such that T(/) = 
fy w(x)f(x}dx. For discrete signal data in countable domain Z one may 
find w such that T(f) = Thus the goal is to obtain the 

linear function w and the scalar b from labeled data. In practice, linear 
classifiers are important given their efficient implementation, and favorable 
bias-variance trade off, especially in classification of high dimensional data 

HU. 

The new signal data transformation framework described in this manuscript 
renders certain classification problems linearly separable in the transform 
space. Linear separability in the transform space gains practical impor¬ 
tance with datasets that contain a small number of high dimensional sig¬ 
nals. When the number of available signals for training are far less than 
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Figure 1: Two types of textures under illumination variation and their corresponding 
intensity histograms. 


their dimension, the nonlinear classihers become prone to overfitting. This 
is a well known effect, and is addressed as the problem of high dimensional 
and low sample size (HDLSS) [20] in the literature. In addition, the overall 
variance of a classifier increases as the classifier becomes more complex j28| , 
and often times simpler classifiers (e.g. linear) can yield higher accuracies 
than more sophisticated ones mi. Transforming the data and rendering it 
to be linearly separable will help maintain small classification error, balance 
the bias/variance tradeoff, streamline the implementation of classihcation 
systems in many real world problems, and could bypass the often time con¬ 
suming process of devising large sets of specially tailored numerical signal 
descriptors and testing each descriptor with various classifiers. 

1.2. An illustrative example: 

Consider the problem of discriminating images of two different image 
patterns. The first column of Figure contains two sample images from 
the UIUC Texture dataset [24j . with their intensity histograms of the cor¬ 
responding textures appearing directly above or beneath each texture. Now 
consider the same texture images, but under different brightness (which 
causes a translation of the histograms) and linear contrast (which causes 
a scaling of the histograms). Such variations in brightness and contrast 
are displayed in the different columns of Figure A generative model for 
the histogram data corresponding to each texture class under brightness and 
contrast variations can be built by translation and scaling of the histograms. 
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In other words, we generate a set of histograms and each 

belonging to class P and Q, by appropriately scaling (a) and translating 
(/j) ‘prototype’ signals po and go, such that Pi(x) = po(ai(x — pi)) and 
qj{x) = qo{aj{x — Pj)). Finally, we note that stationary additive noise in 
these images can be modeled as a convolution of each signal pi or qj with 
the appropriate probability density of the noise model. 

In order to illustrate the main difficulty with utilizing linear classifica¬ 
tion methods under these sources of signal variation, we attempted to train 
a linear classifier to a set of histograms under random brightness (p) and 
contrast (o). We used a well-known Fisher Linear Discriminant Analysis 
(Fisher LDA) method [3] that seeks to maximize the differences in the pro¬ 
jected mean of each class, while at the same time minimizing their intra 
class variances. We also generated a testing set by again applying the same 
brightness and contrast random model to the image data to create a testing 
data. Table contains both the average training and testing error of 5-fold 
cross validation when using this simulated data model. It is clear that while 
the training error is very low, the resulting linear classifier fails to generalize 
to test data not used in training. We note that there is nothing special 
related to the use of the Fisher LDA criterion in solving for w in this ex¬ 
ample. That is, similar results are obtained utilizing linear Support Vector 
Machines instead (see Table [^. 

Simple consideration of the structure of the problem can reveal the reason 
why it is hard to fit linear classifier to the testing dataset. This is because 
a single w, a linear classifier, is unable to ‘cope’ with the translation and 
scaling variations encountered in the test data po{ats{x — pts))- In other 
words, the operation Jy w{x)po{ats{x — pts))dx fails to satisfy equation 0 
for randomly selected ats and pts used to generate the test set. To be clear, 
it is well-known that, for a training set of fixed size, and for data of large 
enough dimension, a linear classifier w can always be found that will near 


Classifier type 

Dataset 

space 

CDT space 

Fisher LDA 

Training set 
Testing set 

0 % 

56.36 % 

0% 

0.84% 

PLDA 

Training set 

41.81 % 

0% 

Testing set 

44.39 % 

0% 

Linear SVM 

Training set 
Testing set 

57.02 % 

50.06 % 

0.20% 

1.60% 


Table 1: Average Classification Error of the texture dataset 


5 









perfectly separate the training data [36]. However, as this simple simulation 
is meant to clarify, such classifier may fail to generalize to testing data if such 
w fails to capture anything meaningful about the mathematical generative 
model of the problem. This is the phenomenon exemplified here. 

Now, the histograms in this problem could be rendered linearly separable 
if, for any input histogram, one could simply ‘mod out’ the translation and 
scaling parameters, thus removing the confounding variations rendering the 
problem not linearly separable. This is the intuition behind the Cumulative 
Distribution transform (CDT). It is able to handle variations such as trans¬ 
lation, scaling, and others by computing rearrangements in the locations 
of the signal intensities with respect to a chosen reference, which does not 
require the estimation of the prototype histograms po and qq. Results in Ta¬ 
ble]^ show that the same Fisher LDA and SVM technique, when applied to 
data that have been transformed with the CDT, is perfectly able to separate 
the data. Below we offer a mathematical explanation for this phenomenon 
through the course of the development of the CDT. 

The paper is organized as follows. Section summarizes notation and 
preliminaries. We present the definition of the CDT in Section then its 
properties in Section H. The linear separability property in CDT space 
is presented in Section [m, and a numerical method for approximating the 
forward CDT for discrete signals is described in Section [^. Finally, in 
Section [^, we present computational examples that show the CDT can 
significantly increase classification accuracy compared to simply treating 
signals in space. 

2. Notation and preliminaries 

Consider two probability spaces {X,T,{X),Iq) and (T, S(y),Xi) where 
X and Y are connected sets in M. S(A) refers to a cr-algebra of measurable 
set A, and Iq and Ii are probability measures, i.e. Xq[X) = 1, XiiY) = 1. 
Furthermore, let Xq{A) > 0, Xi(j 4) > 0 for Lebesgue measurable set A whose 
A(A) > 0, and let Iq and R denote density functions associated with Xq and 
Xi, respectively: dXo(x) = Io{x)dx, dli{x) = Ii{x)dx. Let /i : X —)• T 
define a measurable map that pushes Xq onto Xi such that 



for any Lebesgue measurable A <ZY. 


( 2 ) 


In our case, we will consider d = \ and Xq and Xi that have densities as 
defined above. In this case, the relation above can be expressed, through 
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Lebesgue integration, as 


/»T /» fi I T” I 

f /o(T)dr= f Ii{T)dT. (3) 

Jmi{X) Jml{Y) 

In addition, certain results shown below will require us to interpret mea¬ 
surable densities Io,Ii and maps /i ,/2 as elements of function spaces. 
That is, given a measurable map fi'.X^Y defined as above, for example, 
we can view it as an element of the space of functions whose absolute square 
value is Lebesgue integrable. In this case, the space is denoted as L‘^{X) 
and is defined as the set of functions that satisfy: 



with A referring to the Lebesgue measure in X. 

3. The ID Cumulative Distribution Transform 

Consider two probability density functions Iq and Ii defined as above. 
Considering Iq to be a pre-determined ‘reference’ density, one can use re¬ 
lation Q to uniquely associate /i with a given density Ii. We use this 
relationship to define the Cumulative Distribution Transform (CDT) of Ii 
(denoted as Ii : X —)■ M), with respect to the reference Iq\ 

h{x) = {h{x) - x) yJlo{x). (4) 

with fi'.X^Y satisfying ([^ for x G X. 

Now let Jo : X —>■ [0,1] and Ji : X —)• [0,1] be the corresponding cumu¬ 
lative distribution functions for Iq and Ii, that is: Jo{x) = ^o(x}dT, 

Jlix) = fin{(Y) Iilrjdr. With /i defined in ^ one can re-write Jo : X —)• 
[0,1] as 

Jo(x) = Ji(/i(x)). (5) 

For continuous cumulative distribution functions Jo and Ji (functions whose 
first derivative exists throughout their respective domains), /i is a contin¬ 
uous and monotonic function. If /i is differentiable, ([^ can be rewritten 
as 

Io{x) = f[{x)Ii{fi{x)). (6) 

For measurable but discontinuous functions the relationship above does not 
hold for points at discontinuities. 
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The inverse Cumulative Distribution Transform of Ii is defined as: 

h{y) = = (/r‘)%(/r'te)) (7) 

where : Y ^ X refers to the inverse of /i (i.e. fi^ifiix)) = x), 
fi{x) = Il{x)/^/Io{ x) + X. Naturally, formula Q holds for points where Jq 
and /i are differentiable. By the construction above, /i will be differentiable 
except for points where Iq and Ii are discontinuous. Note that in practice, we 
have control over the definition of Iq, and in our numerical implementation 
described in section 6, we take it to be the uniform density. The example 
presented below shows the CDT of normal distribution density. 



Figure 2: Example 


3.1 


Example 3.1. Consider a probability density of uniform distribution Iq : 

[0,1] ^ M; 

Iq{x) = 1, 

and a normal distribution density Ii : M —>■ M with zero-mean and unit- 
variance (see Figure^: 


h{x) 



/^/i(r)cir = Io(T)dT = 1 holds by definition. To find the CDT for Ii 
with respect to the reference Iq, we first solve for fi : [0,1] —)■ M." 


'•/iW 


Ii(r)dT 



(8) 


By setting $(x) = e /"^dr, Q can be rewritten as 

^ ihix)) = X. 
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<^(x) is a monotonically increasing function, and the inverse exists. Hence, 
we get 


fi{x) = ^ (9) 

By substituting © into we have found the CDT, Ii{x) : [0,1] —)■ M 

/i(a:) = — X. (10) 

Figure^ shows the plot (black dotted line) for the CDT of a normal distri¬ 
bution density function with zero mean and unit variance. 


4. CDT Properties 

Here we describe a few basic properties of the CDT, with the main 
purpose of elucidating certain of its qualities necessary for understanding 
its ability to linearly separate particular types of densities. 


Property 4.1. Nonlinearity The CDT is a non-linear transformation. 

For transformation A to be linear, we must have that A{ali + j3l2) = 
aA{Ii) + f3A{l2). It is easy to check by example 3.1 that this relation does 
not hold. Suppose a = 1/2, j3 = 1/2, Ii be a normal density and I 2 be a 
uniform density. Then A{ali + / 3 I 2 ) / aA{Ii) + /3H(/2). 


Before going on to state further properties of the CDT, it is worth ex¬ 
panding upon the geometric meaning of the CDT. We first note that, using 

^ ^ \l/2 

the standard definition of the norm, i.e. ||/i||L 2 = ( \Ii{x)\‘^dx] , we 

have: 

Il?i|li 2 = [ {h{x)-xflo{x)dx. (11) 

Jx 

As such, the quantity ||/i ||£2 computes the ‘amount’ of intensity from Iq 
at coordinate x that will be displaced to coordinate fi{x). Because fi is 
uniquely defined for nonzero probability densities, the quantity ||/i |||2 can 
be viewed as the minimum amount of ‘effort’ (quantified as density intensity 
X displacement) that must be applied to ‘morph’ Ii onto Iq. This quan¬ 
tity can be interpreted as the optimal transport (Kantorovich-Wasserstein) 
distance between Jq and Ii [38]. Moreover, the set of continuous density 
functions is formally a Riemannian manifold [38| meaning that at any point 
in probability density space, there is a tangent space endowed with an inner 
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product corresponding to the incremental intensity flow (see [40] for more 
details). Therefore the distance between Ji and Jq expressed in ( [IT| ) can be 
interpreted as a geodesic distance over the associated manifold. 

Now consider the distance between the CDT of two different densities Ii 
and I 2 , computed with respect to the same reference Iq: 


\\h-h\\l2= - x) - {f2{x) - x))‘^ Ioix)dx (12) 

Jx 

where /i and /2 correspond to the mappings between Ji and Iq, and I 2 
and Iq respectively. In two or more dimensions, as described in uni, this 
distance can be thought of as the ‘linearized’ optimal transport (generalized 
geodesic) metric between density functions Ii and l 2 - It can be interpreted 
as a azimuthal equidistant projection of Ii and I 2 onto the plane associ¬ 
ated with the incremental intensity flows about the point Iq. For one di¬ 
mensional density functions, however, / is uniquely determined. Hence the 
optimal transport distance computed between densities Ii and I 2 can also 


be expressed through (12) above. In short, the CDT of a given probability 


density function Ij can be viewed as an invertible embedding of the function 
onto a linear space that is isometric with respect to the standard optimal 
transport (also known as Earth Mover’s) distance. 

We now describe important properties of the CDT operation relative to 
density coordinate changes such as translation, scaling, and more generally 
diffeomorphisms applied to a given density. 


Property 4.2. Translation. Let I^ represent a translation of the proba¬ 
bility density Ii by fi, If^ix) = Ii{x — p) . The CDT of I fj, with respect to the 
reference probability density Iq : X ^ is given by Ifj_ : X ^ M.- 

T^{x) = Ti{x) -\- p.y/lo{x). (13) 


For a proof, see \Appendix A 


Example 4.3. Consider a translation of the density function Ii{x) in Ex¬ 
ample \3.1\ by pL 

This is a normal distribution with mean fi and unit variance. The corre¬ 
sponding CDT, Ifx : [0,1] —)• M, for I^ with respect to the uniform reference 
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(a) I^{x) 


(b) If,{x) 


Figure 3: Example 4.3 




(a) I^{x) 


0 0.2 0.4 0.6 0.8 1 

(b) Ta{x) 


Figure 4: Example 4.5 


density Jq : [0,1] — 
the CDT found in 


can be found by the translation property (13) and by 


I^{x) = h{x) + pL = ^ \x) - x + n 


which is translation constant p plus the CDT of zero-mean normal distribu¬ 
tion. Figure\^ is plotted for case when p = 2. 


Property 4.4. Scaling. Let la represent a scaling of the probability density 
h by a, Ia{x) = ali{ax). The CDT of la respect to the reference probability 
density Iq : X ^ M. is given by la '■ X 


) — 

a 


( 14 ) 
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For a proof, see Appendix 


Example 4.5. Consider the density function Ii{x) in Example 3.1 scaled 
with a factor a, such as 


Ia{x) = ah{ax) = 


a (axy 
e 2 




This is identical to a normal distribution with zero-mean and a standard 
deviation The corresponding CDT, la : [0,1] —>■ M, for la with respect 
to the uniform reference density Iq : [0,1] —)> M can he found by the scaling 
property (14) and by the CDT found in (10); 


a 

— X — ax X 
a 

=- X. 

a 

Figure\^ plots this function for the case when a = 2. 


Property 4.6. Composition. Let : Z —>■ M represent a probability 
density that has the following relation with the probability density /i : E —)• M 


Jg{x) = Ji{g{x)). 

Ji : y —)• M and : Z —)• M represent the corresponding cumulative distri¬ 
bution for Ii and Ig respectively, g : Z ^ Y is an invertible, differentiable 
function. The CDT of the corresponding density Ig with respect to the ref¬ 
erence probability density Iq : X —)> M is given by 


hix) = 
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-1 


Y'hix) 


+ X - x \ ^Jlo{x). 


See [Appendix C for a proof. Property 4.6 summarizes one of the main 
characteristics of the CDT transform so far, as rendering diffeomorphic 
transport changes ‘Eulerian’ in the CDT space. In detail, in CDT space, 
the changes in Ig at coordinate xq is only affected by the change of the same 
coordinate xq, i.e. /i(xo). On the other hand, in L^ space, the changes in 


Ig at coordinate xq is affected by the changes in both coordinates xq and 
g{xo), i.e. /^(xo) = g'{xQ)Ii{g{xQ)). 
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5. Linear Separability in CDT space 

One of the main contributions of this paper is to describe how the CDT 
transformation can enhance linear separability of signal classes. Before stat¬ 
ing the main result regarding linear separation, a few preliminary results are 
necessary. As is well-known, the linear separability of two sets in M” is de¬ 
termined by the existence of a separating hyperplane. If two sets are convex 
and disjoint, a separating hyperplane always exists, and hence the sets are 
linearly separable. Furthermore, the converse holds when at least one set is 
an open set [6]. The Hahn-Banach Separation Theorem is a generalization 
of the separating hyperplane theorem for inhnite dimensional spaces. 

Theorem 5.1 (Hahn-Banach Separation Theorem for Normed Vector Spaces) 
Let P and Q he nonempty, convex subsets of a real normed vector space V. 
Furthermore, assume P and Q are disjoint and that one is closed and the 
other is compact. Then, there exists a continuous linear functional T on V 
and 6 G M that strictly separates set P and Q such that 

T{p)<b<T{q), VpGP,V(?GQ. (15) 


For a non-zero linear functional T and a real number b, a hyperplane 
'H{T, b) = {v G V\T{v) = b} can be defined, and a hyperplane that satisfies 
is called a separating hyperplane. For a proof and more details on the 
Hahn-Banach separation theorem, please refer to mm- For spaces, the 
Hahn-Banach Separation Theorem implies that there exists a unique linear 
classifier w that linearly separates two convex sets. To derive this, we need 
the following theorem, which states that every linear functional T on is 


of the form (16) for some w G L?. 


Theorem 5.2. For every continuous linear functional T on there is a 
unique w G L 2 so that 


T{f) = [ f{x)w{x)dx, V/ G 

Jx 


(16) 


In other words, there exists a separating hyperplane in space, ^{{w, b) = 
{x G X\w{x) = b}. For a proof and more details, please refer to [35]. There¬ 
fore, for a continuous linear functional T on L^, a unique w can always be 


found. The following Lemma is a consequence of Theorem 5.1 and Theorem 


5.2 that state there exists a Zmear classifier w that can separate two disjoint, 
convex sets in space. 
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Lemma 5.3 (Linear Classifier for Convex Sets in Space). Let P and Q 
be nonempty, convex subsets of space, where P and Q are disjoint and 
that one is closed and the other is compact. Then, there exists a continuous 
hyperplane 'H{w,b) = {x G X\w{x) = 6} that separates set P and Q such 
that 


lx 


w{x)pi{x)dx < b, Mpi G 


w{x)qj{x)dx > b, G Q, 


lx 


and T-L{w,b) is called a linear classifier. 


(17) 


So far, we have seen that a linear classifier always exists for two disjoint, 
convex sets in with one being compact and the other closed. Moreover, 
the linear classifier would also linearly separate any subset pair from each 
convex hull of each convex set. In other words, two linearly separable convex 
sets imply that any subset pair from each convex hull is linearly separable, 
and vice versa. Therefore, in order to determine whether or not two sets 
are linearly separable, it suffices to show whether any subset pair from each 
convex hull is linearly separable. The following Lemma states this argument 
and will be used to show the main result of the paper. 


Lemma 5.4. [Linear Separation of Compact Convex Hulls of Convex Sets in 
Lfi Space] Two nonempty, compact subsets P and Q in Lfi space are linearly 
separable if and only if both their convex hulls are disjoint, i.e. when the 
following equation holds: 


Np Ng 

^OiiPi 

i=l j=l 


(18) 


for any subset T P C Q, and for any ai, fij > 0 that 

satisfies J2i l^j = ^- 


For proof, see Appendix D 


We now discuss the conditions under which the CDT can render classes of 
1-dimensional probability densities linearly separable. We begin by defining 
a generative model for classes P and Q. 


Definition 5.5. M is a set of monotonic and differentiable functions. P and 
Q are two disjoint sets satisfying 
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Figure 5: Depiction for linear separability properties of the CDT. 


’i’) h'{po o /i) E P, h'{qo o /i) E Q, V/i E H, po £ P, Qo ^ Q. 
ii) Vp E P, Vq E Q, p ^ q (disjoint). 


Note that in the definition above we have used the notation p o h{x) = 
p{h{x)). The definition provides a framework which one can use to construct 
(or interpret) signal classes. In more practical language, we envision signal 
classes as being generated from fundamental patterns, but with distortions 
or confounds applied to them. For example, let po and go be two distinct 
probability densities, which we denote as ‘mother’ densities. Furthermore, 
let H be composed of all translations: hr{x) = x — t , with r a random 
variable. Elements of the sets P and Q are thus p^o hr., and qoo hr, respec¬ 
tively, and can be viewed as translations of the original mother densities. 
In this case, the translation makes up the ‘nuisance’ (confound) parameter 
a classifier must decode to enable accurate separation of the classes. Note 
that we have used the translation case as an example here, and the model 
specified above allows for more complex classes to be created. We note that 
since /i E HI is monotonic and differentiable, its inverse h~^ exists and is also 
differentiable. 

We now describe the main Theorem of this paper clarifying the linear 
separation properties of the newly proposed CDT. 


Theorem 5.6. Linear Separability Theorem in CDT Space LetF,i 


follow be defined aceording to Definition \5.S[ In addition, let h ^ M satisfy 
the following conditions: 


i) V/i E H, h 


-1 


ii) V/i E H and a* > 0 that satisfies ai = 1, h^^ = oiihi ^ G H. 


Hi) V/ii, /i 2 E El, /ii o /i 2 E El. 


Then the eorresponding sets in the CDT spaee P,Q are linearly separable. 


We note that the linear separability theorem is independent of the choice 


of the reference Iq- For a proof, see Appendix E 
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6. Computational Algorithm 

We now describe a numerical method for approximating the CDT given 
discrete data. Recall that the CDT is defined for continuous-time functions 
in contiguous, finite domain. In order to compute the CDT for a discrete¬ 
time signal, we need a way of estimating its cumulative function at any 
arbitrary coordinate. We do so via interpolation. Given a discrete signal of 
N points and an interpolating model, the forward CDT can be estimated 
numerically at all N points. Our numerical method is designed when the 
reference function is Io{x) = 1 for x G [0,1] (recall the linear separation 
properties of the CDT are independent of the choice of reference). The 
computation is formulated with the aid of B-splines m- We use the B-spline 
of degree zero which guarantees that the reconstructed signals are always 
positive, which yields a low complexity algorithm {0{N)). We note that 
under the specific construction below the approximated density functions 
will be discontinuous at the half way point between sampled nodes, and, as 
stated above, reconstruction at these points is not possible. 

Let 7r{x) be the B-spline of degree zero of width r 


7r{x) 


1 xG[-ir, ir] 
0 elsewhere 


and define n(x) = 7r(r)dr as 


n(x) 


0 X < —^r 

< X + X G [—^r, \r] 

j X > ^r. 


(19) 


Let’s denote a A-point discrete-time signal as c = [ci, • • • , cat] and Xi as the 
sample location of c, i.e. c{xi) = Cj, Vi = 1, • • • ,N. We interpolate the 
discrete-time signal c with the B-spline of degree zero to be a continuous¬ 
time signal such as Ii{x) = '^f=iCi'K{x — Xi) for x G [xi — \r,XN + \r\. 
Rewriting ([^, we have 

/■/iG) r/iG) ^ 

/ /i(r)dT= / Ci7r(r — Xi)dr = X (20) 


which can be simplihed further by interchanging the sum and the integral. 


16 



and then using 11 to denote the cumulative integral function of vr as 


N 


'^CiU{fi{x) - Xi) = X. 


( 21 ) 


2=1 


By substituting (19) into (21) and taking the inverse of n(x) which is piece- 
wise linear, fi{x) is computed according to the following algorithm: 

1. When 0 < X < rci, we have cin(/i(x) — xi) = x. Thus, 

fi{x) = — + xi - l-r 

Cl 2 


2. When -rCn + Y.7=i a < x < rCn + Y17=i we have Yll=i Cin(/i(x) - 
Xi) = X. Thus, 


E n —1 1 

i=l Ci , 1 

Jiyx) = - VXn- -r. 

Cr), ^ 


( 22 ) 


3. Proceed until n = N. 


7. Computational experiments 

In this section, we experimentally evaluate the properties of the CDT 
by comparing linear classification performed in CDT space with that in 
original signal space (L^). Specifically, we investigate five cases of signal 
classification; classihcation of texture images from histograms, classification 
of accelerometer signals, classification of flow cytometry data, classification 
of histograms from hand gesture image data, and classification of cell images 
from orientation histograms. Note that our goal is not to propose the ulti¬ 
mate, or optimal, classification method for each application, but rather to 


experimentally validate Theorem 5.6 using both simulated (manufactured) 
data and diverse, real datasets. We note that, with the exception of the 
simulated case in Section we have no precise knowledge of whether con¬ 
ditions i), ii), and iii) for El specihed in Theorem 5.6 hold. Results seem 


to confirm, however, that the generative model specified in these conditions 
has a least some bearing on each problem investigated here. 

As the CDT does not actually prescribe an optimal classiher, we quan¬ 
tify the degree of linear separability of the data by computing classification 
error using three different linear classihers using a standard cross validation 
procedure (or leave-one-out cross validation when data size is small) that 
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separates training and testing data. In addition, we provide more quali¬ 
tative (visual) evidence, by computing a low dimensional projection of the 
data using training data only, that the CDT indeed tends to make data more 
linearly separable. 

7. 1. Experimental procedure 

Average classification error is compared using three different linear clas¬ 
sifiers: Fisher’s linear discriminant analysis (Fisher LDA) method [3], the 
penalized LDA (pLDA) method of Wang et al (HU] , and the linear Support 
Vector Machine (SVM) method [9]. All experiments were performed using 
the MATLAB |30] programming language, while the SVM method was im¬ 
plemented using the LIBSVM package [7j. While the Fisher LDA method 
does not require parameter tuning, the linear SVM and pLDA methods re¬ 
quire parameter tuning steps which were performed using 2"''’* depth cross 
validation utilizing the training set only. In the SVM method, the parame¬ 
ter is set to reflect how much error the separating hyperplane is to tolerate, 
while the parameter in the pLDA method determine the regularization to 
be applied when computing the covariance matrix (refer to references [9l [39] 
for more details). 

The low dimensional visualization plots were computed using the pLDA 
method, which in contrast to the standard LDA method can yield multi 
dimensional embeddings for the given data. The dimensions of each em¬ 
bedding are weighted according to a optimization metric, which combines a 
data separation term (given by LDA) and a ‘data fitting’ term (given by the 


Algorithm 1: 5-fold cross validation 
Partition the dataset into 5 groups. Leave one fold out for testing 
and use the remaining fold for training, 
foreach training set do 

1. For SVM and PLDA, partition the training set into 5 groups 
again. Leave one fold out for validation and use remaining fold 
for training. For LDA, skip to step 2. 

foreach training set (parameter sweep) do 

I. Learn the classifier for different parameter values. 

|_ 2. Compute the validation error. 

Return the best parameter of average validation error. 

2. Learn the classifier with the optimal parameter. 

3. Compute the testing error. 

2. Compute the average classification error. 
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Discriminant Direction 

(a) PLDA projection in space 


1'®* Discriminant Direction 

(b) PLDA projection in CDT space 
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standard Principal Component Analysis cost function). For each experiment 
reported below, we utilize the pLDA method to visualize a 2-dimensional 
embedding of the testing data. In each case, a subset of the data was used 
to estimate the lower dimensional embedding. Remaining (testing) data was 
used to obtain the visualizations. 


The computational experiments shown in Sections 7.2, 7.4 7.5, 7.6 


were 


computed using a five-fold cross validation strategy, with 80% of the data 
used for training, and 20% for testing. For experiment in Section 7.3, due 


to small sample size, a leave-one-out cross validation is used instead. The 
experimental procedure is summarized in Algorithm For more details on 
cross validation experimental procedures, refer to references [mil!- 


7. 2. Texture classification from intensity histograms 

In this application, already discussed in the introduction as a motivating 
example, our goal is to utilize the CDT to distinguish between two types of 
texture images, under brightness and contrast variations, from their intensity 
histograms. Consider the textures displayed in the middle rows of Figure 
Their corresponding histograms are shown directly under and above each 
image, with variations in brightness and contrast. Variations in brightness 
correspond to translations in the histograms, while variations in contrast 
correspond to scalings (dilations) of the histograms. We note that such 
variations (translation and scaling) satisfy the necessary properties described 
in Theorem |5.6[ Our theory thus predicts that the histogram data would be 
perfectly separable in CDT space. For testing this hypothesis, we generated 
a set of 128 images (2 sets of 64 images) by applying 8 random variations 
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in brightness, with the translations in the range of [0,0.5], and 8 random 
variations in contrast, with scalings in the range of [0.6,1.67]. Results are 
shown in Table and confirm that 1) the data is not linearly separable 
in histogram space and 2) becomes linearly separable in CDT space. The 
lower dimensional representation of the original data using Penalized LDA 
also confirms this (see Figure]^. 


1.3. Activity Recognition with Accelerometer Data 

An accelerometer is a device that records the acceleration of a moving 
object. Modern ‘smartphones’ are commonly equipped with a 3-axis ac¬ 
celerometer that keeps track of the acceleration in 3 different directions x, 
y, and z, and accelerometers have been widely adapted to various wearable 
devices (e.g. watches) for human activity recognition. In this example, we 
aim to detect (classify) two different activities given accelerometer data ob¬ 
tained from an iphone 5. Class 1 consists of a person swinging arms while 
holding the phone. Class 2 consists of a phone being dropped to the ground. 
Figure 7a shows the raw data recorded from the accelerometer for both 
cases. We note that in this case, the signals varied in length given the differ¬ 
ent duration of the episodes. Signals were zero-padded so that they match 
the length of the largest signal. 10 sample signals were acquired for each 
class. For each instance, the Energy = x'^ + + z'^ is computed from the 


tri-axis measurements (see Figure [Tb] ). Here we compare the ability of the 
linear classification in original (energy) signal space versus in CDT space. 

Results are shown in Table[^ and clearly indicates that the data becomes 
linearly separable in CDT space. The lower dimensional representation of 
the original data using Penalized LDA (PLDA) [39] indicates (see Figure]^ 
that each class forms convex hulls that are linearly separable in CDT space 
but not in energy signal space. For this example, both training and testing 
data are represented in the lower dimensional embedding in Figure [^ By 
seeing Figure [^ we can verify that the linear classifier computed using only 
training set correctly separates both training and testing set in CDT space, 
but not in energy signal space. 

In this experiment, it is apparent that the signals varied in terms of in¬ 
tensity and the location where the maximum peak has occurred, and this 
explains the inability of linear classifiers to perform well in original (en¬ 
ergy) signal space. As explained above, the CDT is able to overcome such 
variations. 
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Figure 7: Two classes of accelerometer dataset, swinging (top row) vs free falling (bottom 
row) 
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(a) PLDA projection in space (b) PLDA projection in CDT space 
Figure 8: PLDA projection for accelerometer dataset 


Classifier type 

Dataset 

space 

CDT space 

Fisher LDA 

Training set 
Testing set 

0 % 

50 % 

0% 

10% 

PLDA 

Training set 

0% 

0 % 

Testing set 

60 % 

5 % 

Linear SVM 

Training set 
Testing set 

8.75% 

55 % 

7.5 % 

10 % 


Table 2: Average classification error of the accelerometer dataset 


7.4-- Flow Cytometry 

Flow Cytometry is a technique used to analyze light emission proper¬ 
ties of grouped cells using fluorescence markers. In this example, we uti¬ 
lize an existing database (the FlowRepository database pi]) to distinguish 
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DNA histograms between normal subjects and donors diagnosed with acute 
myeloid leukemia (AML), obtained from peripheral blood or bone marrow 



(a) Raw data (b) Histogram (c) CDT 

Figure 9: Two classes of flow cytometry data, AML (top row) vs. Normal (bottom row) 
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(a) PLDA projection in space (b) PLDA projection in CDT space 

Figure 10: PLDA projection for flow cytometry dataset 


Classifier type 

Dataset 

space 

CDT space 

Fisher LDA 

Training set 
Testing set 

6.81 % 

15.01 % 

5.82% 

11.31 % 

PLDA 

Training set 
Testing set 

11.55 % 

12.03 % 

7.75% 

9.15 % 

Linear SVM 

Training set 
Testing set 

10.39 % 
11.46 % 

8.65% 

8.88% 


Table 3: Average Classification Error of the flow cytometry dataset 
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aspirates. The data included 8 measurements per each subject, where fluo- 
rochrome signals were detected at the 620nm wavelength specifically. Sam¬ 


ple data is shown in Figure 9a where the x-axis represents each cell that 
passed through the flow cytometry sensor, and the y-axis correspond to the 
DNA intensity measurement of the cell at wavelength 620nm. The intensity 
histogram with 1024 intensity levels are computed and their corresponding 


CDTs (see Figure 9b and Figure 9c). 


The average classification error is reported in Table We note that 
the classification in the signal space using LDA (test accuracy of 84.99%) 
is worse than the line of chance (87.5%), given the uneven distribution of 
patient data. Comparison with the line of chance and the classification 
accuracy in histogram space using PLDA or SVM also suggests that lin¬ 
ear classifiers trained are more or less equivalant to random classification. 
However, classifying data in CDT space suggests that linear separation is 
possible, and the Cohen’s Kappa for this computation (0.3) confirms fair 
agreement [Ij. 

7.5. Cambridge Hand Dataset 

The Cambridge hand gesture dataset consists of 900 image sequences of 
3 primitive hand shapes (see Figure [TT^ where each image sequence consists 
of around 60 frames of 3 different motions [23]. In this example, we try to 
distinguish 3 different hand shapes; flat, spread, and v-shape. There are 
2678 images for flat hands, 2992 images for spread hands, and 2764 images 
for v-shape hands, and each image was taken under arbitrary positioning and 
illumination. A preprocessing step computes the edge of each image (240 
X 320 pixels large) and the corresponding indices of the edge pixels. Two 
histograms are created counting x coordinates and y coordinates of the edge 

Corresponding CDTs are computed for each x 
The classification is done with concatenation 


pixels per image Figure lib 


and y histogram Figure 11c 


of two X and y histograms and concatenation of two x and y CDTs. 


Classifier type 

Dataset 

space 

CDT space 

Fisher LDA 

Training set 
Testing set 

13.92 % 

16.11 % 

4.58% 

5.76% 

PLDA 

Training set 
Testing set 

38.02 % 

38.21 % 

6.73% 

6.97% 

Linear SVM 

Training set 
Testing set 

13.77% 

15.73 % 

1.27% 

1.65% 


Table 4: Average Classification Error of the hand gestures dataset 
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(a) Raw data 


(b) Histogram 


(c) CDT 


Figure 11: Three different classes of hand gestures dataset 
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(a) PLDA projection in space (b) PLDA projection in CDT space 
Figure 12: PLDA projection for hand gesture dataset 




Results are shown in Table which clearly indicate that the data be¬ 
comes more linearly separable in CDT space. As in previous examples, 
the two dimensional representation of the original testing data using Penal¬ 
ized LDA (PLDA) [39] indicates (see Figure 12) that classes form convex 
hulls that are linearly separable in CDT space and not in histogram space. 
Moreover, this example shows that the CDT can be applied to multi-class 
problems which would enhance the simplicity of the classification problem. 


1.6. Actin and Microtubules Classification 

Our goal in this experiment is to quantify how well actin and micro¬ 
tubule filaments in HeLa cells |5| differ from one another in terms of their 
orientation distributions. Fluorescence microscope images of HeLa cells were 
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(a) Raw data (b) Filter Response (c) Histogram 


(d) CDT 


Figure 13: Two classes of HeLa dataset, Actin (top row) vs. Microtubules (bottom row) 
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(a) PLDA projection in space (b) PLDA projection in CDT space 
Figure 14: PLDA projection for HeLa dataset 


Classifier type 

Dataset 

space 

CDT space 

Fisher LDA 

Training set 
Testing set 

0.53 % 

2.66 % 

0.40% 

2.59% 

PLDA 

Training set 
Testing set 

0.14 % 
1.59 % 

0.92% 

1.07% 

Linear SVM 

Training set 
Testing set 

0 % 

0.53 % 

0.26% 

1.05% 


Table 5: Average Classification Error of the HeLa dataset 


grouped into two classes according to their protein structure: rhodamine- 
conjugated phalloidin, which labels F-actin and a monoclonal antibody 
against beta-tubulin (microtubules). Each image was pre-processed such 
that outside the cropped region was set to 0 and contrast-stretched to have 
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full scale (see Figure [l3a| ). In order to compute the orientation of each pixel, 
the images were filtered with 32 Gabor filters of size 9x9, and for each pixel, 
the hlter with the maximum response is selected and labeled from 1 to 32 


(see Figure 13b). A histogram of orientation filter responses are computed 


for each image (see Figure 13c) and then the CDT is computed for each 
histogram (see Figure 13d). In this example, both histogram and CDT 
show excellent classification accuracy, given that the difference between two 
protein structures are hard to be recognized by visual inspection. It is an 
instance where data is already well (linearly) separated in Euclidean space, 
and is also linearly separable in CDT space (i.e. the CDT did not destroy 
linear separation in this example). 


8. Discussion and conclusions 

In this paper we have described a new nonlinear operation, termed the 
Cumulative Distribution Transform (CDT), that takes as input signals that 
can be understood as probability density functions, and outputs a contin¬ 
uous function that is related to morphing that signal to a chosen reference 
signal. We have shown that, irrespective of the reference choice, the trans¬ 
form is useful for converting signal variations that are ‘Lagrangian’ in the 
sense of displacing (transporting) intensities throughout the signal space, to 
operations that are ‘Eulerian’ in the sense that they become (simpler) ad¬ 
justments of the intensities, without transporting them, in transform space. 
This conversion is the basis for our main Theorem 15.61 that states the nec¬ 
essary conditions for the CDT to make signal classes linearly separable in 
transform space. In addition to describing a few of its properties, we have 
extensively studied the ability of the CDT to improve the linear separability 
in comparison to the linear separability in original signal space experimen¬ 
tally in five diverse applications involving both simulated and real data. In 


all examples shown, the results of Theorem 5.6 are confirmed. 

The CDT is cheap to compute. Above we described a numerical approxi¬ 
mation for discrete signals that is 0{N), with N the length of the signal. Its 
computational efficiency combined with theoretical and experimental results 
presented above suggest that the CDT could be a useful tool for building 
more complex signal pattern recognition systems in a variety of applications. 
We emphasize that we envision the CDT to be a useful pre-processing step 
in the search for solutions for complex problems. Once data is transformed 
in CDT space, other techniques (e.g Fourier and Wavelet transforms, Har- 
alick features, etc.) can be applied to the transformed data as well. The 
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fact that the CDT is a mathematically invertible transform ensures that no 
information will be lost in this step. 

The CDT can be compared to the ’feature map’ in kernel methods IMj 
considering that the CDT maps the raw data into the CDT space (analogous 
to the feature space). However, kernel methods avoid explicit formulae of 
the feature maps, and rather apply kernel function to the raw data. Hence, 
the data is rarely analyzed in the feature space in kernel methods. On the 
other hand, CDT has the advantage of directly utilizing the data in the 
CDT space (feature space). For example, we can invert the linear classiher 
in the CDT space to that in the raw data space, and by looking at the 
latter, we can examine which signal characteristic distinguishes two classes 
of signals. Moreover, kernel methods require the data to be a vector, a 
discrete sequence. However, for many natural signals such as physiological 
signals, their nature is continuous. We emphasize that the CDT methods is 
more continuous friendly than the kernel methods. 

The main limitation of the CDT as a ‘feature extraction’ method is 
that, as presented, it can only be applied to signals that can be interpreted 
as probability density functions (hence positive signals). We note, however, 
that this is not an impediment to its application in a wide variety of data 
that naturally satisfy the positivity constraint (normalization to a density 
function can be achieved by a scaling factor). Examples of problems that 
involve naturally positive data include commodity (e.g. stock) prices [52], 
photon counting devices (e.g. image pixel intensities) [32], cell counting 
devices (e.g. flow cytometry) [SlIM], analysis of fMRI signals [50], analy¬ 
sis of frequency densities (e.g. Fourier descriptors) |42) . orientation filters 
naio], 3D shape or patch-histograms [l2] , spectral densities m and many 
others. Pattern recognition systems for such applications normally consist 
of a feature extraction step, and a statistical pattern analysis (e.g. classifica¬ 
tion) step. In situations where the data being analyzed is naturally positive, 
the CDT could be used as a step in this pipeline that could simplify (and 
enhance the performance) subsequent feature extraction and classification. 

is 


Yet another limitation of the CDT model, as stated in Theorem 5.6 


that the linear separability properties depend on the signals being generated 
from mother signals through the application of a differential, one to one 
monotonic function with additional restrictions. In certain cases, a physical 
model for the data can help determine whether the conditions for linear 
separability in CDT space are applicable. This is the case for the application 
involving texture discrimination under brightness and contrast variations 
shown in the introduction. For other applications (e.g. cancer detection 
from flow cytometry data), however, we have no underlying physical model 
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to determine whether the necessary conditions for linear separability are 
met. In some cases, the CDT can indeed be a poor match for the problem. 
One example would be signal/image classihcation using texton histograms 
m- The reason being because the independent variable of these signals has 
no specific order, and thus the meaning of derivatives with respect to the 
chosen independent variable ordering is not clear. As such, the model we 
utilize for the signal classes, which depends on the application of a smooth 
function to a ‘mother signal’, does not quite apply. In such cases, the CDT 
can still be applied, though we currently offer no information regarding 
whether the CDT would enhance (or help destroy) linear separability. The 
variety of examples shown above, however, have helped us confirm the model 
is applicable, at least to some extent, to not an insignificant number of 
applications. 

Finally, the work presented here is preliminary, and it could be useful to 
expand it into several directions. One natural direction would be to utilize a 
similar technique (conversion from ‘Lagrangian’ to ‘Eulerian’ point of view) 
to simplify pattern recognition for a broader class of signals, including signals 
that can obtain negative values, as well as 2-dimensional and 3-dimensional 
images. Yet another direction to follow is to study whether the CDT for¬ 
malism has any benefit in sampling and signal estimation problems. These, 
and other extensions, will be the subject of future work. 
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Appendix A. Proof for translation property 

Consider a probability density Ii : [yi,y 2 \ —^ IK, and let Ifj, : [yi + 
fi,y 2 + fj] —)• M represent a translation of the probability density Ii by p, 
i.e. I^{x) = Ii{x — p). To find the CDT for with respect to the reference 
probability density Jq : A —)• M, we solve for /^ ; A —)• [ 7/1 + p, y 2 + p]'. 



(A.l) 


And similarly, to find the CDT for Ii with respect to the reference Iq, we 
solve for /i ; A [yi,y 2 ]- 



(A.2) 


(A.l) and (A.2) can be set equal. 



(A.3) 


By substituting Ii for in (A.3), we have 



(A.4) 


By the change of variables theorem, we can substitute u = t — p in (A.4) 






Since upper limit on left and right side of the integrals are equal, we have 
ffiix) = fi{x) + /i. Substituting this into expression for I^{x) = {f^{x) — 
X )v%( x), we have 

M - x)^/Io(x). 

By substituting Ii(x) = (fi(x) — x)^/Io(x), we have proved the translation 
property 

If,(x) = Ii(x) + fiy/lo{x). 


Appendix B. Proof for scaling property 

Consider a probability density Ii : [^ 1 , 2 / 2 ] and let Iq : [yi/a,y 2 /a] —)• 
M represent a scaling of the probability density Ii by a, i.e. Ia{x) = ali{ax). 
To find the CDT for with respect to the reference Jq : A —)• M, we solve 
for fa: X [yi/a,y 2 /a]: 


rfa{x) rx 

/ Ia{r)dT = / Io{T)dT, 
Jyxja JiniiX) 


mf{X) 


(B.l) 


And similarly, to find the CDT for Ii with respect to the reference Iq, we 
solve for /i : A [^ 1 ,^ 2 ]: 


<'h{x) 


'yi 


h{T)dT = f h{T)d 

Jinf(X) 


(B.l) and (B.2) can be set equal, 

rfo.{x) 
h\la. 


rfo,(x) rh{x) 

/ Ia{T)dT = / Il(r)dT. 
Jyi/a Jyi 


(B.2) 


(B.3) 


(B.4) 


By substituting la = ali{ax) in (B.3), we have 

rfa(x) rfi{x) 

/ aIi{aT)dT = / Ii{T)dT. 

Jyi/a Jyi 

By the change of variables theorem we can substitute ar = u, adr = du in 

Co.faix) fhix) 

/ Ii{u)du = / /i(r)dT. 

Jyi Jyi 
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Since the upper limit on left and right side of the integrals are equal, we have 
fa{x) = Substituting this expression for Ia{x) = {fa{x) — x)^JIq{x), 

and cleaning up some algebras, we get /„ : X —)• M: 

^ h{x) -x{a-l)^JlQ{x) 


Appendix C. Proof for composition property 

Let Ii : y —)• M represent a probability density, and Ji : P —>■ M its 
cumulative distribution function. Let Ig : Z —)> M represent a probability 
density that has the following relation with Ji: 


Jg{x) = Ji{g{x)). 


(C.l) 


Jg : Z —>■ M represent the corresponding cumulative distribution for Ig, and 


O' : Z —)• y is an invertible, differentiable. By differentiating each side of 

4(^) = 9 \x)Ii{g{x)). 


(C.l), we have 


To find the CDT for Ig with respect to the reference probability density 
Iq : X ^ M., we solve for fg-.X^Z: 


/ Ig{T)dT = / Io{T)d 
JinUZ) Jini(X) 


rfai^) 

/inf(Z) 

And similarly, to find the CDT for Ii, we solve for fi : X ^ Y: 

ffl{x) rx 

/ h{T)dT = / Io{T)dT 
Jmf{Y) Jmf{X) 


(C.2) and (C.3) can be set equal 


rh(x) 

Ig(T)dT = / Ii{T)dT. 
inf(Z) Jmi(X) 


By substituting Ig{x) = g'{x)Ii{g{x)) in (C.4), we have 


/inf(Z) 


rhi^) 

g iT)Ii{g{T))dT = h{T)d 

Jinf(Y) 


(C.2) 


(C.3) 


(C.4) 


(C.5) 
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By the change of variables theorem we can substitute ^(r) = u, g'{T)dT = du 
in (|C.5p, 

raifgi^)) rhi^) 

/ Ii{u)du = / /i(r)(ir. 

Jinl{Y) Jini{Y) 

Since the upper limit on left and right side of the integrals are equal, we 
have 

9{fg{x)) = fi{x). 

Since g is an invertible function, fg{x) = g~^{fi{x)) holds. By substituting 
this expression for Ig{x) = {fg{x) — x)y/Io{x) , and cleaning up some algebra, 
we get /g : Z —)• M: 


^gix) = {g ^ {fl{x)) - x) y/lo{x) 




Appendix D. Proof for Lemma 5.4 


Proof, (if) The convex hulls of compact convex sets are compact in space. 
Therefore, the convex hulls are compact. For disjoint, compact convex sets 


sets, we know from Lemma 5.3 that there exists a hyperplane that linear 
separates the two. Therefore, if convex hulls are disjoint (i.e. ( [I^ holds), 
then P and Q are linearly separable. 

(only if) Suppose P and Q are linearly separable but there exists convex 
hulls of P and Q that are not disjoint, i.e. there exist C P, {qj}^li C 

Q, and Oi, (dj > 0 that satisfies Si=i ~ ^ 


Np Nq 


(D.l) 


2=1 


for finite Np, Nq. We can easily see that this contradicts linear separability. 
Suppose there exists a linear classifier (i.e. w{x) = b exists that satisfies 
By multiplying each side of ( |D.1[ ) with w{x) and integrating over X, 
we have 











The left side of (D.2) is always smaller than b because 


/X 


o^iPijx) j dx = aj j w{x)pi(x)dx <'^^aib = b. (D.3) 


On the other hand, the right side of (D.2) is always larger than b because 


lx 


w{x) iY^/3jqj{x)\dx = '^Pjj^w{x)qj{x)dx>Y^f3jb = b (D.4) 


However, (D.3) and (D.4) contradict to the equivalence in (D.2), which 


implies that the linear classifier w cannot exist. Therefore, the convex hulls 
must be disjoint if linear classifier exists. □ 

Appendix E. Proof for Theorem |5.6| 


Proof. We show that 
contradict Definition 


5.5 


, Q must be linearly separable. If not, it would 
tha t the y are disjoint. Suppose P, Q are not linearly 


separable. Then by Lemma 


5.4 




uppc 

there exist {pi}iZi C P, C Q, and 


ai,f3j > 0 that satisfies 1 such that the convex 

combination of {pi}^Xi {qj}^li are equivalent, i.e. 

Np Nq 

i=l j=l 

By substituting pi = {fi — l)\/Io and qj = {gj — where 1 refers to 

an identity map, we have 


'^Oiifi - l)v% = ~ ^)V^- 

J=i 


i=l 


By using ~ dividing each side of the equation by 

/q, we have 

Np Nq 

^aifi = '^Pjgj. 

i=l j=l 
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By substituting fi = h^^ofQ and gj = o (see Lemma E.l presented 
below), we have 


iVp 


N, 

°/o) = ^/3jihJ^ o go) 

i=i 


By substituting = Yl^=i ^ = Yl^h l^j^j have 

° fo = ° 9 o- 

By composing each side of the equation with ha, we have 


fo = hao o go. (E.l) 

Note that h~^, ha, ha°h'^^ G El by conditions i), ii), in). From the definition 
of the CDT in Q with respect to reference Iq, we have 

f'oipo o /o) = g'oigo o go) = h- 


By substituting /q with the right side of (|E.l ), we have 


{ha o h'^^ o goYipo o {ha o h^^ o go)) = go{qo o go) 
{ha o h'^^y{po o {ha o h'^^)) = go 

^ Kli-^Po{Kli-^) = go- 


The last step of the equation is derive by setting ha/s-i = ha o h^^, where 
hag-^ G El. However, the last statement contradicts the Definition 5.5 that 


h'po{h) and h'qo{h) each belong to disjoint set P and Q. Therefore, P, Q 
must be linearly separable. 


□ 


Lemma E.l. Let fo, fi be monotonic funetions from X —)• Y for probability 
densities po : E —)■ M, p, : T —)■ M with respect to reference Iq : X —)■ M, such 
that 

/ Po(r)dT = / Pi{T)dT = / Io{r)dT. (E.2) 

Jinf(Y) Jinf(Y) Jmi{X) 

Then pi = h{{po o hi) implies hio fi = fo. 
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Proof. Substituting (E.2) with pj = h[po{hi), we have 


/ h'i{T)po{hi{T))dT = / h{T)dT 

Jinl{Y) Jmf(X) 


inf(X) 


By change of variables theorem, substituting hiij) = u and h[{T)dT = du, 
we have 

rhi{fi(x)) px 

/ po{u)du = / h{T)dT. 

Jinf(y) Jinf(X) 

Since Po(T)dT = f^f(x) holds (see ( |E.2[ )), we have 


rhi{fi{x)) rfo{x) 

/ Po{r)dT = / P(}{T)dT. 

/inf(Y) Jinf(Y) 


(E.3) 


The upper limits on each side of the integrals in (E.3) can be set to be equal 
since both /j and hi are strictly increasing functions: 

hi{fi{x)) = fo{x). (E.4) 

Equivalently, we have fi{x) = hf^{fo{x)) by inverting (E.4). □ 
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